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Abstract 

The numerical and computational aspects of chiral fermions in lattice quantum chro- 
modynamics are extremely demanding. In the overlap framework, the computation 
of the fermion propagator leads to a nested iteration where the matrix vector mul- 
tiplications in each step of an outer iteration have to be accomplished by an inner 
iteration; the latter approximates the product of the sign function of the hermitian 
Wilson fermion matrix with a vector. 

In this paper we investigate aspects of this nested paradigm. We examine several 
Krylov subspace methods to be used as an outer iteration for both propagator 
computations and the Hybrid Monte-Carlo scheme. We establish criteria on the 
accuracy of the inner iteration which allow to preserve an a priori given precision for 
the overall computation. It will turn out that the accuracy of the sign function can 
be relaxed as the outer iteration proceeds. Furthermore, we consider preconditioning 
strategies, where the preconditioner is built upon an inaccurate approximation to 
the sign function. Relaxation combined with preconditioning allows for considerable 
savings in computational efforts up to a factor of 4 as our numerical experiments 
illustrate. We also discuss the possibility of projecting the squared overlap operator 
into one chiral sector. 
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1 Introduction 



For two decades numerical simulations of very light quarks within lattice quan- 
tum chromodynamics have remained intractable as the chiral symmetry of the 
underlying QCD Lagrangian, which holds in the case of zero mass quarks, 
could not be embedded into flavour conserving fermion lattice discretisation 
schemes. The standard workaround took recourse to simulations with fairly 
heavy quarks instead and extrapolated the results over a wide range of quark 
masses to the very light quark mass regime. Unfortunately, simulating far be- 
yond the realm of chiral perturbation theory such extrapolations carry large 
systematic errors which have to be avoided in order to achieve a sufficient 
precision of phenomenological observables [41]. 

It was realised by Hasenfratz some years ago [31] that considerable progress 
can be achieved in this bottleneck problem through switching to a discreti- 
sation scheme that obeys a lattice variant of chiral symmetry, as expressed 
by the Ginsparg- Wilson relation for the quark propagator [25] which in turn 
implies a novel version of chiral symmetry on the lattice [40]. Theoretically, 
such a scheme induces a dramatic reduction in fluctuations in the vicinity of 
quark mass zero. Shortly before the rediscovery of the Ginsparg- Wilson rela- 
tion, Neuberger had constructed the overlap operator [46,42], a very promising 
candidate for a chiral Dirac operator [49,47]. It implies the solution of linear 
systems involving the inverse matrix square root or the matrix sign function 
(of the hermitian Wilson-Dirac operator Q). This can be turned into an in- 
triguing practical method to simulate light quarks through iterative methods 
following an inner-outer paradigm: One performs an outer Krylov subspace 
method where each iteration requires the computation of a matrix- vector prod- 
uct involving sign(Q). Each such product is computed through another, inner, 
iteration using matrix-vector multiplications with Q. 

The problem of approximating the action of sign(Q) on a vector has been dealt 
with in a number of papers, using polynomial approximations [32,36,33,9,35], 
Lanczos based methods [5,6,3,56] and multi-shift CG combined with a partial 
fraction expansion [48,45,19,20]. In an earlier paper [53] we have introduced 
the Zolotarev partial fraction approximation (ZPFE) as the optimal approxi- 
mation to the matrix sign function. ZPFE has led to an improvement of about 
a factor of 3 compared to the Chebyshev polynomial approach [33] . This tech- 
nique to compute the sign function is meanwhile established as the method of 
choice, [24,17,15]. Moreover, it is the natural starting point for both the sim- 
ulation of dynamical overlap fermions [21,16] and so-called optimised domain 
wall fermions [12,11,13]. 

So far simulations with overlap fermions have been restricted to the quenched 
model, where fermion loops are neglected, because of the sheer costs of the 
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evaluation of the sign function on matrices with extremely high dimensions 
[33,38,27,26]. The challenge today is to step away from the quenched model 
and include dynamical fermions. At this point we have the unique opportunity 
to devise optimised simulation algorithms for overlap fermions, investigating 
novel numerical and stochastical techniques. 

Indeed, efficient methods to compute the sign function are only half of the 
story. It is equally important to design the entire nested iteration in an optimal 
manner. This means that we should care on how accurately we actually need 
the sign function to be computed in each step of the outer iteration process 
in order to achieve a given accuracy. As we will show, to achieve a given 
accuracy for the solution of the entire system, one can relax the accuracy 
of the computation of the sign function as the iteration proceeds. In this 
manner, the computational effort is reduced substantially. In addition to this 
approach, we will use the concept of recursive preconditioning of a Krylov 
subspace method to obtain further accelerations. 

In the present paper — which is part of a continuing series — we show that the 
use of relaxation strategies and recursive preconditioning in the linear sys- 
tem solver will substantially improve over existing methods, gaining a factor 
of 3 to 4 in computational speed in dynamical simulations on realistic lat- 
tices. Together with the improvement of ZPFE over Chebyshev polynomials, 
we therefore now have an improvement factor of about 10 over early overlap 
propagator computations [33]. These results are practical without any restric- 
tions, i.e., they rely on available computed quantities only. We do assume 
that there are computable error bounds for the approximation quality of the 
sign function. As was shown in our earlier paper [53], this is the case for the 
Zolotarev approach using multi-shift CG (Theorem 7 in [53]) as well as for 
a Lanczos based approach for Q 2 (Theorem 6 in [53]). All our results are 
obtained projecting out a number of low lying eigenvectors of the hermitian 
Wilson fermion operator. We briefly discuss the optimisation of the number 
of projected eigenvectors, taking into account the additional effort to generate 
these low lying modes by means of the Arnoldi algorithm. 

The paper is organised as follows: in Section 2 we briefly review results from [1] 
which relate different formulations of Neuberger's operator to optimal Krylov 
subspace methods for the solution of the corresponding linear systems. In 
Section 3 we apply the results from [54] to these methods and we obtain 
strategies on how to choose the accuracy for the inner iteration (evaluating 
the matrix vector multiplication sign(Q)y) at each step of the outer iteration. 

Section 4 presents further improvements based on the 'recursive' precondi- 
tioning technique, i.e., we use an inaccurate solver for the system as a precon- 
ditioner for each step of the outer iteration. As we will point out, recursive 
preconditioning might be considered a generalisation and improvement of ap- 
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proaches suggested by Giusti et al. [26] and Borigi [4]. For the purpose of 
illustration, Sections 3 and 4 will contain results from numerical calculations 
for a realistic, but small (4 4 ), example configuration. Results on more numer- 
ical experiments are given in Section 5 where we achieve improvement factors 
in a range from 3 to 4. 



2 Krylov subspace methods for the overlap operator 

2. 1 Notation and Basics 

The Wilson-Dirac fermion operator, 

M = I - kD w , 

represents a nearest neighbour coupling on a four-dimensional space-time lat- 
tice, where the 'hopping term' D\y is a non-normal sparse matrix, see (A.l) 
in the appendix. The coupling parameter k is a real number which defines the 
relative quark mass. 

The massless overlap operator (using the Wilson operator as a kernel) is de- 
fined as 

A) / -.!/• (MM) ~. 

For the massive overlap operator, for notational convenience, we use a mass 
parameter p > 1 such that this operator is given as 

D = pi + M ■ (M f M)-5, (1) 

with p > 1. How this form relates to Neuberger's choice and to the quark mass 
is explained in the appendix, (B.l). 

Expressing (1) in terms of the hermitian Wilson fermion matrix Q = 75 M, 
see (A. 2), the overlap operator can equivalently be written as 

D = pi + 7 5 sign(Q) = 75 • (p7 5 + sign(Q)), 

with 75 being defined in Appendix A and sign(Q) being the standard matrix 
sign function. Note that p7 5 + sign(Q) is hermitian, whereas 75 sign(Q) is 
unitary. To reflect these facts in our notation, we define 

D u = pi + 75 sign(Q), D h = fry 5 + sign(Q), 

where D u = j b D h . 
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In a simulation with dynamical fermions, the costly computational task is the 
inclusion of the fermionic part of the action into the 'force' evolving the gauge 
fields. This requires to solve linear systems of the form 

D\D u x = b <^> D\x = b. (2) 

From a practical point of view this means that we want to find an approximate 
solution x for (2) such that 

\\Dlx-b\\ 2 <0(e). (3) 

The value e is prescribed and depends on the accuracy of the overall process. 
In this paper we assume that this value is given. 

The major part of this paper is concerned with numerical methods for the 
above 'squared' equation, but we will occasionally also consider the equation 

D u x = b (4) 

which has to be solved when computing propagators. 

The standard solution method for solving the linear systems (2) or (4) is based 
on a nested iteration scheme. The outer iteration consists of an iterative linear 
system solver that invokes in every iteration step a vector iteration method 
for approximating the action of the matrix sign function to a vector. In the 
case of the squared system (2), this inner iteration must even be done twice. 

2.2 Adequate Krylov Methods 

In order to be self-contained, let us summarise results from [1], where Krylov 
subspace methods for the outer iteration are discussed in detail. 

Solving the propagator equation 

D u x = b (5) 
is equivalent to solving the symmetrised equation 

D h x = 7 5 & = b (6) 

or one of the normal equations 

D 2 h x = D h b, or D 2 h y = b,x = D h y. (7) 

Interestingly, for all these equations one has feasible optimal Krylov subspace 
methods at hand, i.e. methods, which rely on short recurrences and which 
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obtain iterates satisfying an optimality condition on the Krylov subspace gen- 
erated by the matrix of the respective equation: The normal equations (7) can 
be solved with the CG method (its iterates minimise the error in the energy 
norm), the symmetrised equation (6) can be solved via the MINRES method 
(its iterates minimise the residual in the 2-norm), and the shifted unitary sys- 
tem (5) can be solved with a less well known method of Jagels and Reichel 
[37], which we termed SUMR in [1] (its iterates have again minimal residual 
in the 2-norm). 

The theoretical results from [1], backed up by numerical experiments, show 
that solving (5) via SUMR is the best of all these methods, resulting in savings 
of up to 30% as compared to the other two approaches which both require 
approximately the same computational work. 

When it comes to solving the squared equation 

D 2 h x = b D\D u x = h (8) 

we have two basic options: Either solve (8) as it stands, using the CG method 
for the hermitian and positive definite matrix D\ = D^Du, or using a two 
pass strategy solving 

D h y = b, D h x = y, 

or 

D lv = fo > D u x = y. 

From the previous discussion it is immediately clear that the latter form of 
the two-pass strategy is to be preferred, and the results from [1] further show 
that solving (8) via CG is usually the best option. 



3 Strategies for the accuracy of the inner iteration 

In the first paper of this sequence [53], we discussed a posteriori error es- 
timators for various vector iteration methods that construct approximations 
from a Krylov subspace to the action of sign(Q) to a vector. This included 
Lanczos-type methods and computational schemes based on the multi-shift 
CG method. The control over the error of the matrix-vector products is very 
important in a two-level iteration scheme and in this section we discuss how to 
exploit this. For generality, we consider the solution of a generic linear system 

Ax = b, 

where A and b depend on the formulation used and A involves somehow the 
matrix sign function of Q. In step j + 1 of the Krylov subspace method we 
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have to compute an approximation s to the product of the matrix A times a 
vector, say y, as 

\\Ay-s\\ <ri r \\A\\ ■ \\y\\. (9) 

An obvious choice is to pick rjj fixed and equal to e, the overall accuracy in 
(3), in every iteration step. Since this can be seen as raising the unit roundoff 
to a level of e, we expect that (3) can be achieved in this case. However, better 
strategies for choosing rjj do exist. 

In the past few years, various researchers have investigated the effect of ap- 
proximately computed matrix-vector products on Krylov subspace methods. 
Outside the context of this paper, this plays a role in, for example, electromag- 
netic applications [10], the solution of Schur complement systems [8,52,55] and 
eigenvalue problems [29]. This work has led to, so-called, 'relaxation strate- 
gies' for choosing the rjj, starting with the empirical results in [7,8] and later 
followed by the more theoretical papers [54] and [52]. The goal of these relax- 
ation strategies is, given a required residual precision of order e (similar as in 
(3)), to minimise the total amount of work that is spent in the computation 
of the matrix-vector products. It turns out that accurate approximations to 
the matrix-vector product are required in the very first iteration steps, but 
this precision can be relaxed as the methods proceed (which explains the term 
relaxation). In this section we summarise the main conclusions which are of 
interest to nested iterations for the QCD overlap formulation. 

In a Krylov subspace method, for example the CG method, in every iteration 
step an approximation to the residual, r k , and an iterate x k are computed, 
also when the matrix vector product is not exact. Unfortunately, from the 
very first iteration on, due to the approximate matrix-vector products, the 
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true residual, b — Ax k , and the computed approximation to the residual, r 
drift apart. Therefore, the vector r k is not a good estimator for the quality of 
the computed iterate. The approach taken in [54] is to consider the inequality 

\\b-Ax k \\ < || r k - (b - Ax k ) || + || r k ||. 

S v ' V y > 

true residual residual gap computed residual 

The computed residual norms ||r fc || can be monitored during the iteration 
process and there is overwhelming numerical - and partially theoretical - 
evidence that the computed residuals initially decrease and stagnate in the end 
at a level smaller than the size of the unknown residual gap. From a practical 
point of view, this means that strategies for controlling the error of the matrix 
sign function can be derived by bounding the size of the gap in terms of the rjj 
and subsequently choosing the rjj such that the size of the residual gap does 
not become larger than the order of e. This approach is taken in [54,55] and 
it confirms and leads to improvements upon the empirically found strategies 
proposed by Bouras et al. [7,8]. For clarity we discuss this in more detail for 
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the CG method where the matrix A is hermitian positive definite. 

In the CG method the iterate and residual are updated using the formula 

r i — r j-i _ aj-iqi -1 , x j = x^~ l + ccj-ip 7-1 

with 



J— 1 II 2 



\W^-Ap>^\\<r, i .^\\A\\^\\p'^\\ and aj _ 1= 1 

qj-u ■ p3-i 

A simple inductive argument shows that 

k— 1 

\\r k -(b-Ax k )\\<Y,vM 3 \-\\A\\-\\^\\. 

3=0 

To continue we need to bound \aij\ ■ \\p*\\ and to keep our discussion pertinent 
we will start by considering the size of these quantities in case of exact matrix- 
vector multiplications. From the definition of a,-, we have 

Mil 2 |H| 2 

M - m= wir\-m- (10) 

It is straightforward to bound the first term in (10). Using qi = Ap* , we see 
that with an exact multiplication this term is smaller than Further- 
more, using the recursion of the conjugate search directions 

pj _ r i _ ^■/^■_ 1 -pP~ l and 7j = 1 1 r J 1 1 , 
it follows by exploiting orthogonality properties that 



Up* II 



2 • 



|| r i ||-2 



\ i=0 



As is explained in [54] (giving the details would be beyond the scope of this 
paper), it is reasonable to assume that in the case of an inexact matrix- vector 
product there is a modest constant c such that right hand side of (10) times 
c is an upper bound for \aj\ ■ \\p*\\- 

With this assumption we have that the residual gap after k steps is bounded 
as 

\\r k - (b - Ax k )\\ <c\\A\\. \\A- l \\ p 3 = (£ \\rT 2 r 1/2 - 

j=0 i=0 

Since, we are interested in a final residual precision of about e our strategy 
is to keep the residual gap of this size. Hence, we propose, following [54,55] 
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Algorithm 1 RelCG(A, b, e) 

{computes x with \\Ax — b\\ < e • ||6|| via relaxed CG} 

x = 0; {initial value} 

r = b; 
p = r; 

laid = 7 = r 1 " • r; 

C = 1/7; 

while yfy > e • ||6|| do 

compute q with — g|| < e • ||6|| • ||p|| ■ a/0 
P = q*-p; 

a = 7//3; 
x = x + a • p; 
r = r — a ■ q; 
7 = r ' • r; 

C = C + i/7; 

laid = 7; 

end while 



Fig. 1. generic relaxed CG 
which improved upon the empirical strategy from [8], 

Vj = f , (11) 

which guarantees that 

||r fe - (6- Ar fc )|| < c • k ■ \\A\\ WA' 1 \\e. (12) 

For the purpose of illustration, Figure 1 gives an algorithmic description of 
the CG method with this strategy for tuning the errors in the matrix-vector 
products. 1 

If we assume that the computed residuals r- 7 decrease and we terminate in step 
k where ||r fc || is smaller than e then the size of the true residual ||6 — Ax k \\ is 
bounded by (12) plus e. This shows that we have achieved our accuracy goal 
despite the fact that we work with less accurate matrix- vector products as the 
iterative process proceeds. Notice that we have not given a priori guarantees 
that the computed residuals become smaller than e (with a comparable speed 
to the exact process) and, furthermore, that c is not a very large constant. 
Unfortunately, we are not aware of the existence of rigorous estimates for these 

1 Matlab code for all methods presented in this paper is publicly available through 
the Internet at www.uni-wuppertal.de/org/SciComp/preprints 
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shifted unitary 
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Vj = e/lM| 




(D u , equation (5)) 









Table 1 

Advised Krylov subspace method and corresponding strategy for tuning the pre- 
cision of the matrix-vector products as a function of the properties of the matrix 
A. 

quantities in the context of relaxation (there are results for the case that rjj is 
constant, see [30]). However, numerous numerical experiments show that this 
is not an issue in practice and the advantage of this way of deriving strategies 
for picking rjj is that these quantities can be monitored at little additional 
cost, if necessary. 

One issue remains if we want to apply the discussed relaxation strategy for 
solving (8). In this case we must be able to assess the accuracy of our computed 
approximation s to D\y through the accuracy in computing the action of 
sign(Q) on a vector, see (9). We do so by expanding D\ as 

D\ = (p 2 + 1)/ + p 7 5 sign(Q) + psign(Q) 75 , 

so that we achieve {{D^y — s\\ < r]\\y\\ by requiring that the approximations Si 
to sign(Q)y and s 2 to sign(Q)(>y 5 y) fulfil 

|| sign(<2)y - si|| < — r]\\y\\ and || sign(Q)(7 5 y) - s 2 \\ < — r)\\y\\. 



3.1 Relaxation strategies for SUMR and MINRES 

So far, we have discussed relaxation for the CG method since this is fairly 
straightforward due to its two-term recurrences. In [54] a general framework 
is given that allows an analysis for a large variety of Krylov subspace meth- 
ods. We refer the reader to this paper for more information. In general, the 
relaxation strategies proposed there for these methods guarantee bounds on 
the residual gap of the form (12). In Table 1 we have summarised the strate- 
gies for choosing the r]j for the Krylov subspace methods relevant for the 
formulations in the previous section. 
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0.5 1 1.5 2 1 1.5 2 2.5 3 

real part real part 




Fig. 3. Relaxed and non-relaxed CG for (8). Left: a = 0.3. Right: ^ = 0.1. 

For the SUMR method mentioned in the previous section, no relaxation strate- 
gies have been proposed so far. Unfortunately, an analysis of SUMR with ap- 
proximate matrix-vector products is more involved than for the other iterative 
solvers since no residuals are computed during the iterative process (only their 
lengths are available). However, for theoretical purposes we can introduce an 
additional recursion for the residual vectors and consider the residual gap. It 
is then possible to show that the same results apply as for the full GMRES 
method in [54, Section 7]. Without going into details, let us just state that 
therefore we expect good results for SUMR using the same strategy as used 
for the GMRES method, see Table 1. 



3.2 Numerical illustration 



To illustrate the effects of relaxation, we here report on numerical experiments 
for a simple test situation. More extensive experiments will be reported in 
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Section 5. We use the 4 4 example configuration Confl from Section 5, which 
is the result of a dynamical simulation at /3 — 5.4. The hopping parameter in 
the Wilson fermion matrix was taken as k = 0.2. The spectrum of the Wilson 
fermion matrix M is given in the left part of Figure 2, the spectrum of the 
overlap operator D u is plotted on the right. We solved the 'squared' equation 
(8) using the CG method for the mass parameters \i = 0.3 and /i — 0.1, where 
p=(l + /x)/(l-//),see(B.l). 

The plots in Figure 3 give the norm of the (computed) residual as a function 
of the number of matrix-vector multiplies (MVMs) with Q. These MVMs 
all occur in the multi-shift CG method when approximating sign(Q)x via 
the Zolotarev approach. Details of our implementation are given in Section 5. 
Each plot contains two convergence curves, one for CG without relaxation, i.e., 
with a fixed precision for the MVMs, and one with the relaxed CG method 
described in Algorithm I. 

In the relaxed CG methods we also used a high accuracy inner iteration for 
computing the sign function to compare the true and the computed residuals. 
The true residuals are plotted in Figure 3 as dotted lines. We see that the 
true and the computed residuals are virtually the same until they are down 
to e = 10~ 6 , the required accuracy, which was the parameter used in (11). 
We see that the relaxation strategies yield an improvement in the order of 
20%, regardless of the value of //. This improvement is larger in the more 
realistic computations to be reported in Section 5. There we perform an addi- 
tional eigenvalue projection step to speed up the overall computation. In this 
situation relaxation leads to larger gains ranging from 30% to 40%. 



4 Further improvements: Recursive preconditioning 

In the previous section we discussed strategies for controlling the error of the 
matrix-vector products. The preliminary numerical experiments there showed 
that a reduction of at least 20% can be expected compared to the case of 
using a fixed precision for the matrix-vector products in all steps. Two impor- 
tant practical observations should be made, see also [55, Section 3]. First, we 
note that, if the number of iterations to reach the desired residual reduction 
is large, then there can be a considerable accumulation of the errors in the 
matrix-vector product in the residual gap. This is reflected for the CG method 
by the fact that the required number of iterations appears in the upper bound 
on the residual gap in (12) as a factor k. In practice, this might mean that the 
tolerance on the matrix-vectors has to be decreased, which is the tactic taken 
in [52]. But more importantly, the strategies discussed in the previous section, 
take the error in the matrix- vector products, essentially, inversely proportional 
to ||r J ||. Krylov subspace methods often show superlinear convergence, mean- 



12 



Outer iteration 

GMRESR 



preconditioner 



Preconditioning iteration 




inner iteration 


Krylov subspace method 


MVM 


multishift CG for 


Zolotarev approximation 



Fig. 4. Overview of our recursive preconditioning computational scheme 



ing that the convergence speed increases as the iterative method proceeds. 
Hence, the number of inexpensive approximations to the matrix-vector prod- 
ucts is relatively small and this limits the maximal gain that can be achieved 
with the relaxation approach. 

Stating the same observation from a different point of view, we conclude that 
the relaxation strategy should work particularly well when the convergence of 
the iteration is fast (but linear) from the very beginning. In order to achieve 
this, we investigate the idea of 'preconditioning' the Krylov subspace method 
by another (inexact) Krylov subspace method set to a larger tolerance of £j 
in step j + 1. We refer to this as recursive preconditioning. To stay consistent 
with the terminology used so far, we refer to the inexact Krylov method and 
its variable preconditioner as the outer iteration and preconditioning iteration 
respectively, reserving the term inner iteration to the method which approxi- 
mates the matrix vector product. The inner iteration is thus used in both, the 
preconditioning and the outer iteration, see Figure 4. 

Methods that can be used for the outer iteration are the so-called flexible 
methods. These are methods that are specially designed for dealing with vari- 
able preconditioning, e.g., [28,51,57]. In [55] these methods were combined 
with approximate matrix-vector products. For our numerical experiments we 
chose the GMRESR (GMRES recursive, [57]) method as the outer iteration. 
Note that other choices like flexible GMRES [51] are an equally good option. 
We have chosen GMRESR here since it is slightly more straightforward to 
implement. The paper [55] analyses various choices for the accuracies r]j and 
C,j and shows that rjj = e/||r J || and £j = £ fixed are good choices. Figure 5 
gives a Matlab-style algorithmic description of the overall method, which we 
call relaxed GMRESR. To stress the preconditioning iteration, we sometimes 
add it in parenthesis, so that, e.g., relaxed GMRESR(CG) means that we use 
the CG method as the preconditioning iteration. 

The recursive application of iterative solution methods is often encountered 
in scientific computing applications. For example, van der Vorst and Vuik 





inner iteration 


MVM 


multishift CG for 


Zolotarev approximation 
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Algorithm 2 relGMRESR(A, b, e) 

{computes x with \\Ax - b\\ < e • ||6|| via relaxed GMRESR} 

x = 0; {initial value} 

r = b; 

C — []; {empty matrix} 
[/=[]; {empty matrix} 
while ||r|| > e ■ ||6|| do 

so/i>e A-u = r to relative accuracy £ (/or example u = relCG(A, r, 
{preconditioner} 

compute c with ||Au — c|| < e • ||6|| • ||w||/||r||; 

for i—l:size(C,2) do 

c = c — /3 • C[:, i\; 

u = u — [3 ■ U[:, i]] 
end for 
c = c/\\c\\; 
u = u/\\c\\; 
C = [C,c}- 
U = [U,u}; 
a = • r; 
x = x + a ■ u; 
r = r — a ■ c; 
end while 



Fig. 5. Relaxed GMRESR 

notice in [57] that preconditioning GMRES with a fixed number of iterations 
of GMRES can give a considerable improvement over restarted GMRES. This 
explains the name 'GMRESR' which we keep in this paper although we use 
other choices for the preconditioner. 

In the context of approximate matrix-vector products, nested iterations have 
been used by Carpentieri in his PhD thesis [10]. He uses flexible GMRES in 
the outer iteration and GMRES in the inner iteration for an application from 
electromagnetics where the matrix-vector products are approximated using a 
fast multipole technique set to a fixed precision. The paper [55] shows nu- 
merical experiments for a Schur complement system that stems from a model 
of global ocean circulation. Using the relaxed preconditioned approach one 
gets a significant reduction in the amount of work spent in the matrix-vector 
products. 

A related idea for the QCD overlap formulation has recently been advanced by 
Giusti et al. [26, Section 9] in a method which they call an adapted-precision 
inversion algorithm. Their scheme corresponds to the approach presented here 
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if, instead of GMRESR, one takes a simple Richardson iteration as the outer 
iteration and if, in addition, the residuals are computed directly. The authors 
of [26] do not discuss specific choices for the precision of the matrix-vector 
product in the outer iteration and use a fixed precision in the precondition- 
ing iteration. Our more general approach allows the use of more sophisticated 
outer iterations like GMRESR and, moreover, gives a specific and computa- 
tionally feasible strategy on how to choose the precision of the inner iteration. 

It is also interesting to mention that the idea of adapted precision inversion is 
related to an earlier approach of Hernandez et al. in [33] and Borigi in [4]. 



5 Numerical experiments 

Next we present numerical experiments carried out in a realistic setting. To 
this purpose we have developed a Hybrid Monte-Carlo program (HMC) with 
either one or two flavours of dynamical overlap fermions based on the Zolotarev 
partial fraction expansion [53]. Details as to the construction of the overlap 
fermion force within the HMC can be found in Appendix D. 

We have generated decorrelated configurations with one flavour (see Sec- 
tion 5.3 below) of dynamical overlap fermions on an 8 4 -lattice at f3 — 5.6, 
and with two flavours on a 4 4 lattice at f3 — 5.4. For our experiments the 
Wilson kernel mass parameter has been adjusted to k — 0.2. It is known that 
the locality properties and spectral density of the overlap operator depend 
strongly on the value of k used, with k ~ 0.2 being the optimum value, at 
least in the quenched theory on large lattices [34]. 

We have used a mass parameter p [44], which, according to (B.l), is equivalent 
to p = (1 + p)/(l — p), and we have chosen p = 0.1 (p = 1.22), and p = 0.3 
(p = 1.857). These values of p are similar to the smallest non-zero eigenvalues 
of the overlap operator and, given our small lattices, there will be little change 
in the results when moving p to smaller valence mass. 

Our results will be given for five configurations for the 4 4 lattice volumes, 
separated by 50 HMC sweeps, and on three plus five configurations, separated 
by 20 HMC sweeps on the 8 4 lattice. Additionally, some computations were 
performed on a configuration from a quenched {(5 = 6.0) ensemble, with the 
inversions performed at k = 0.2, with three values of p, 0.3, 0.1 and 0.03. 

All our computations were carried out on the Wuppertal cluster computer 
ALICE, using 16 processors for the calculations on the 8 4 and 16 4 lattices, 
and one processor for the 4 4 calculations. 
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5.1 Projecting out low-lying eigenvectors 



Let a and b denote the smallest and largest eigenvalue of Q 2 , respectively. 
Then, in principle, we need the Zolotarev approximation to the sign function 
for the domain [— y/b, — y/a\U[y/a, Vb]- As a gets smaller, we need an increasing 
number of terms in the Zolotarev approximation in order to obtain a given 
accuracy. It is possible [18] to accelerate the calculation of the sign function 
by calculating the n p 'smallest' eigenvectors of the Wilson operator, and by 
treating them exactly. The Zolotarev approximation is then only needed on 
a domain [—Vb, — Va 7 ] U [Va 7 , y/b] with a' being not larger than the n p + 1 st 
smallest eigenvalue of Q 2 . 

Besides from allowing us to shrink the domain over which we need an accurate 
approximation to the sign function, projecting out small eigenvectors also 
improves the condition number of the Wilson operator Q. As a consequence 
the multi-mass inversion to be performed for the Zolotarev approximation 
converges faster. 

There are two different ways in which we can project out the eigenvalues - 
either out of the sign function, or out of the multi-mass solver for the partial 
fraction expansion. Our preference is to fix a' at some suitable value, and to 
project the eigenvalues below Vol directly out of the sign function, and those 
eigenvalues above Vol out of the multi-mass solver [16] (see Appendix C). 

We calculate the coefficients of the Zolotarev expansion so that the sign func- 
tion is approximated to machine precision within the range [Vol, Vb] (see Ap- 
pendix D). We cannot vary the values of a' and b in the outer iteration across 
the trajectory of the HMC algorithm without violating detailed balance. Some 
fine tuning of the optimal value of a' is possible, but this optimisation lies out- 
side the scope of this paper: here we keep b fixed at 10, and a' fixed at 10~ 5 . For 
all the calculations described in this section, we took n p = 28, which always 
resulted in a n p + l st smallest eigenvalue larger than a' = 10~ 5 , as required. The 
performance gain obtained from the eigenvalue projection is briefly described 
in Appendix C. 

In the preconditioner, b and a' were allowed to vary. We took b as the largest 
eigenvalue of Q 2 (usually around 5), while a' was the n p +l st smallest eigenvalue 
(around 10~ 3 for the 4 4 lattices, and 10 -4 for the 8 4 lattices). Since we need 
the sign function less precise in the preconditioner, the number of poles in the 
Zolotarev approximation could be taken quite small (see Appendix D), thus 
minimising the computational effort in the preconditioner. 



16 




First 4 4 , fj, = 0.3 configu- Third 8 4 , fj, = 0.1 config- 

ration from table 2 uration from table 3 



Fig. 6. Convergence history for unrelaxed CG, relaxed CG and relaxed GMRESR 
for one inversion of the squared equation (2). We plot the norm of the residual 
vs. the number of calls to the Wilson operator Q. The tics indicate each (outer) 
iteration: On the 4 4 lattice relaxed GMRESR needs 5 iterations to converge, whereas 
unrelaxed and relaxed CG both require 11 iterations. 

5.2 Results for the squared overlap operator 

Figure 6 gives the convergence history for the third configuration on the 8 4 
and the first on the 4 4 lattice. The plots show the residual norm against the 
numbers of MVMs with the Wilson fermion matrix Q. Since the latter dom- 
inate the performance of the entire process, we can consider them as a first 
approximation of the overall execution time. The figure compares the unre- 
laxed CG, relaxed CG and the relaxed GMRESR(CG) methods. In addition 
to the plots, we give the actual timings for our implementations in Tables 
2-3. Note that for the preconditioned iterations the gain in time is more than 
the gain obtained in MVMs with Q. The reason is that in the preconditioner 
we only have five different shifts to work on when performing the multi-mass 
inversion for the Zolotarev approximation, as opposed to the 25 shifts to be 
used in the outer iteration. 

Comparing the times from the tables we see that on the 8 4 lattices relaxation 
already reduces the computational effort by a factor of 1.7. Additional pre- 
conditioning further reduced the effort by an additional factor of 2.2 (taking 
£j = £ = 0.1 independently of j since there was little change in the gain for 
0.01 < < 0.3). 

The gain turned out to be smaller on the 4 4 ensembles, with only about 
a factor of 1.5 gain for the relaxation, and an additional factor of 1.3 for 
the preconditioner. The gain on the 16 4 quenched configuration is similar to 
the gain for the 8 4 ensembles. The lower gain for the preconditioner on the 
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Method Confl Conf 2 Conf 3 Conf 4 Conf 5 



CG 


53 


55 


53 


57 


55 


relCG 


34(1.56) 


35(1.57) 


36(1.47) 


37(1.54) 


38(1.45) 


relGMRESR(CG) 


19(2.78) 


20(2.75) 


24(2.21) 


25(2.28) 


26(2.11) 






H = 0.3 








Method 


Conf 1 


Conf 2 


Conf 3 


Conf 4 


Conf 5 


CG 


50 


46 


44 


46 


48 


relCG 


33(1.52) 


31(1.48) 


30(1.47) 


32(1.44) 


31(1.55) 


relGMRESR(CG) 


22(2.27) 


20(2.30) 


23(1.91) 


25(1.84) 


21(2.29) 



At = 0.1 

Table 2 

Times (in seconds) for one inversion on the five 4 4 configurations with (3 = 5.4, run 
on 1 processor of ALiCE. The number in brackets is the gain from the unrelaxed 
and unpreconditioned (CG) inversion. 

smaller lattices is due to the fact that the CG inversion on 4 4 lattices converges 
already quite fast, a consequence of the particular eigenvalue distribution of 
the 4 4 lattice already observed in Figure 2. Thus, the 4 4 inversion spent more 
time in the outer GMRESR algorithm (compared to the time spent in the 
CG preconditioner) than the 8 4 inversion did. Since one sweep through the 
GMRESR algorithm takes considerably longer than one sweep through the CG 
preconditioner, this reduced the gain of the preconditioning on the smallest 
lattices. For the same reason, the gain achieved by preconditioning is larger 
as we decrease the overlap mass, especially on the larger lattices, where the 
overlap operator is less well conditioned. For example, on the 16 4 quenched 
configuration, the gain for using the preconditioner is 1.75 times larger for 
pL = 0.03 than for \x = 0.3. 

5.3 Chiral projection. 

Based on investigations of the Schwinger model, the authors of [2] suggested 
that it might be beneficial to project the squared overlap operator to one chiral 
sector. We have 

Dl = Dl + Dl; 
Dl = l -D 2 h (I ± 75) 

= (/ ± Ts) ± \ (I ± 75) sign(Q) (/ ± 75) • 
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Method Confl Conf 2 Conf 3 Conf 4 Conf 5 



CG 


1419 


1139 


1216 


1307 


1305 


relCG 


754(1.88) 


697(1.63) 


737(1.65) 


816(1.60) 


767(1.70) 


relGMRESR(CG) 


319(4.45) 


301(3.78) 


315(3.86) 


364(3.59) 


341(3.82) 






H = 0.3 








Method 




Confl 


Conf 2 


Conf 3 




CG 




1965 


2052 


2039 




relCG 




1202(1.63) 


1250(1.64) 


1234(1.65) 




relGMRESR(CG) 


614(3.20) 


567(3.61) 


547(3.72) 





p = 0.1 

Table 3 

Times (in seconds) for one inversion on the three 8 4 configurations with (3 = 5.6, 
run on 16 processors of ALICE. 



Method 


fi = 0.03 


n = 0.1 


p = 0.3 


CG 


31430 


9022 


3493 


relCG 


18813(1.67) 


5981(1.51) 


2610(1.34) 


relGMRESR(CG) 


6642(4.73) 


2329(3.87) 


1286(2.71) 



Table 4 

Times (in seconds) for one inversion on the quenched 16 4 configuration at (3 = 6.0, 
run on 16 processors of ALICE. 

Because [-£^,75] = 0, the eigenvalues 2 , (\j±) 2 of D\ are the same, except for 
the zero modes and their partners. If there are no zero modes then 

det D u = det D 2 + = det D 2 _ , 

since the non-zero eigenvalues of D u are 



Zero modes can be treated exactly at the end of the simulation by re-weighting 
the observables according to (2(1 — p)/(l + p))' < ^ / ' Ar - f , where Qf is the fermionic 
topological charge 3 . This means that we can run Nf = 1 simulations by 
projecting <p into one chiral sector and running the HMC with D\ rather than 



2 For a more detailed discussion of the eigenvalue spectrum of D u and see [1] . 

3 An alternative is to introduce a second Metropolis step after a certain number of 
trajectories 
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Method 



Conf 1 



Conf 2 



Conf 3 



Conf 4 



Conf 5 



CG 



631 



564 



623 



635 



641 



relCG 



399(1.58) 362(1.56) 399(1.56) 413(1.54) 398(1.61) 



relGMRESR(CG) 190(3.32) 171(3.30) 180(3.46) 199(3.19) 181(3.54) 



Times (in seconds) for the inversion of chiral fermions on the 8 4 , /j, = 0.3 ensemble, 
run on 16 processors of ALiCE 

D\. The calculation of D\ only requires one call to the sign function rather 
than two, so in principle working in one chiral sector should run an Nf = 1- 
simulation in half the time it takes to run an Nf = 2-simulation without the 
projection. 

There are two advantages in using the chiral projection. Firstly, we can work 
in the topological sector that contains no zero-modes, which means that the 
inverse of the Dirac operator exists even at p — 1 (// = 0). The convergence of 
the inversion should be improved when the fermion mass is of the same size as 
the smallest non-zero eigenvalues. It is however unlikely that on more realistic 
lattice sizes that we will be able to run at small enough masses to see such an 
effect. 

Secondly, and more importantly, it should allow more frequent changes of the 
topological charge in the chiral sector opposite to the one we are working in, 
which means that better samples in configuration space and reduction of the 
autocorrelation time can be achieved. This effect was seen in the Schwinger 
model [2], and our early results suggest that it is present in four dimensions as 
well. In fact, HMC runs without chiral projection are very resistant to changes 
in the topological charge. 

However, there is also a disadvantage with this method. At low fermion masses, 
the Qf — configurations will dominate the statistical average. As a conse- 
quence Qf 7^ configurations will turn out to be relatively unimportant. 
Whether these disadvantages outweigh the advantages is an open question. 
However, even if it is not advantageous to use chiral projection for the up and 
down quark contributions to the determinant, it would certainly be a useful 
tool if we wish to include a dynamical strange quark in a Nf = 3 simulation. 

Table 5 summarises results for these chiral projection computations with D\ 
on the 8 4 configurations with /i — 0.1. Since D\ is hermitian positive definite, 
the unrelaxed and relaxed computations were done with CG. The precon- 
ditioned variant took relaxed GMRESR as the outer iteration with the CG 
iteration being the preconditioner. There is, as expected, an overall gain of a 
factor of approximately 2 for the chirally projected inversion, which is due to 
the fact that we only need to call the sign function once for each application 



Table 5 
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first 4 4 ,/i = 0.3 configu- third 8 4 ,/i = 0.1 configu- 

ration from Table 2 ration from Table 3 



Fig. 7. Convergence history for unrelaxed SUMR, relaxed SUMR and relaxed GM- 
RESR(SUMR), two pass strategy for (2) 

of the squared overlap operator. This gain will not appear in any HMC sim- 
ulation, since we need to run two such inversions to get a Np = 2 ensemble. 
However, single-flavour simulations come for free in this framework. Again, 
the gain from relaxation and preconditioning is between 3 and 4, although 
the gain from relaxation and preconditioning is slightly smaller when chiral 
projection is used (compare Table 5 and 3). 



5.4 The two pass SUMR inversion and propagator calculations. 

In a last series of investigations, we performed two pass computations (see Sec- 
tion 2.2) on the 8 4 and 4 4 configurations. We now compare unrelaxed SUMR, 
relaxed SUMR and relaxed GMRESR(SUMR), i.e. the preconditioning is done 
with SUMR. As before, we required a fixed accuracy £j = £ = 0.1 in the pre- 
conditioning iteration. As can be seen in Figure 7 and Tables 6 to 8, there 
is again an improvement of around a factor of 3 to 4 when relaxation and 
preconditioning are used, and again this factor increases as we decrease /i, 
especially on the larger lattices. In [1], we predicted that two passes through 
SUMR should theoretically take roughly the same time as one pass through 
CG, although in numerical tests we found that the two pass method was 
slightly slower. Comparing Tables 6 to 8 and Tables 2 to 3, we can see that 
as we move to larger lattices the unrelaxed two pass strategies do take ap- 
proximately the same time as the single CG inversion. However, the gain in 
using the preconditioner is generally larger for the SUMR inversion than for 
the CG inversion. If this trend continues, then as we move to larger dynamical 
simulations, it might be preferable to use the two pass strategy to calculate 
the HMC fermionic force. Indeed, on our 16 4 quenched configuration at the 
lower masses, two SUMR inversions already improve upon one CG inversion 
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Method 


Conf 1 


Conf 2 


Conf 3 


Conf 4 


Conf 5 


SUMR 


68 


59 


59 


66 


63 


relo U Mil 


AO/I A o^ 

4o(1.4z ) 


A 1 /l A A \ 

41(1.44 J 


A 1 ( 1 A A \ 

41(1.44 ) 


A Kf1 A1\ 
40(1.4 / J 


A A ( 1 /I Q\ 

44(1. 4o ) 


reiLjMJXiijojt^o u ivixt j 


Zo[Z. ( Z) 


OAfO Afii\ 
Z4(Z.40 j 


zo(z.o / ) 


oZ[Z. UD J 


Z\)\Z.i ( ) 


Method 


Conf 1 


H = 0.3 
Conf 2 


Conf 3 


Conf 4 


Conf 5 


SUMR 


88 


87 


89 


89 


94 


relSUMR 


60(1.47) 


61(1.43) 


59(1.51) 


64(1.39) 


66(1.42) 


relGMRESR(SUMR) 


23(3.82) 


21(4.14) 


28(3.18) 


30(2.97) 


34(2.76) 


n = 0.1 



Table 6 

Times (in seconds) for two SUMR inversions on the five 4 4 configurations at (3 = 5.4, 
run on one processor of ALICE. 



Method Confl Conf 2 Conf 3 Conf 4 Conf 5 



SUMR 


1538 


1181 


1222 


1288 


1286 


relSUMR 


804(1.91) 


748(1.58) 


795(1.54) 


818(1.57) 


796(1.62) 


relGMRESR(SUMR) 


383(4.02) 


351(3.36) 


383(3.19) 


392(3.28) 


382(3.37) 






H = 0.3 








Method 




Confl 


Conf 2 


Conf 3 




SUMR 




2272 


2685 


2510 




relSUMR 




1695(1.34) 


1661(1.62) 


1500(1.67) 




relGMRESR(SUMR) 


674(3.38) 


650(4.13) 


576(4.36) 





n = 0.1 



Table 7 

Times (in seconds) for two SUMR inversions on the 8 4 configurations at (3 = 5.6, 
run on 16 processors of ALiCE. 



when we use the preconditioning technique. 

To calculate the propagator we need to perform a single inversion of the overlap 
operator using SUMR. The time needed for this calculation will be half the 
time for the two pass inversions described in this section. 
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Method 



H = 0.03 



H = 0.1 



H = 0.3 



SUMR 31550 8312 3200 

relSUMR 18840(1.87) 6038(1.38) 2656(1.20) 

relGMRESR(SUMR) 5974(5.82) 2252(3.69) 1382(2.32) 

Table 8 

Times (in seconds) for two SUMR inversions on the quenched 16 4 configuration, 
run on 16 processors of ALiCE. 



6 Discussion 



Ginsparg- Wilson fermions, such as overlap fermions, offer an intriguing pos- 
sibility to overcome the bottleneck which affects dynamical simulations with 
Wilson fermions at light quark masses. The lattice chiral symmetry satisfied 
by the Ginsparg- Wilson fermions will also enable us to study aspects of QCD 
such as chiral symmetry breaking and topology better than it is possible with 
Wilson fermions. 

The bottleneck of dynamical simulations with Ginsparg- Wilson fermions is the 
computational time. In order to invert the overlap operator in the course of 
the Monte-Carlo simulation, we need to run a nested inversion, which means 
that overlap fermions are of the order (9(100) times as expensive as Wilson 
fermions. The hope is that by improving algorithmic techniques we can bring 
the computational cost down to something manageable. In this paper, we have 
studied two such techniques - relaxation of the accuracy of the inner inversion, 
and using a low accuracy approximation of the sign function in a precondi- 
tioner. Considering the results on the 4 4 lattices less typical, we conclude that 
these techniques lead to a factor of 3 to 4 improvement in the computational 
effort required to invert the overlap operator. Improvements tend to be even 
better for more demanding computations, i.e. when \i becomes smaller. Our 
approach comes on top of the more classical eigenvector projection-techniques, 
to which it contributes its improvements in a multiplicative manner. 

Using the techniques outlined in this paper (and anticipating further improve- 
ments), we hope to be able to run HMC algorithms using overlap fermions on 
moderate lattice sizes on the next generation of cluster computers. There are 
some subtleties when running the HMC algorithm with overlap functions, the 
most notable being that the derivative of the sign function generates a step 
function in the fermionic force. We plan to discuss the Hybrid Monte-Carlo 
algorithm in more detail and present some initial results on small lattices in 
a subsequent paper. 
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A Definitions 

The Wilson-Dirac matrix reads: 

M nm = 5 nm — kDw, 
where the hopping term is defined as 

D W = XX 1 ~ lv)Unip)&n,m-n + (1 + lii) U K n ~ ^n,m+n- (A.l) 

n 

k is the hopping parameter, which is defined as k = 1/(8 — 2m ), where 
m is the Wilson mass. The hermitian Euclidean 7 matrices satisfy the anti- 
commutation relation {7«,7j} = 25ij i,j = 1, ...4. 75 is the product 75 = 
71727374, which means that {75,7^} = 0. The hermitian form of the Wilson- 
Dirac matrix is given by multiplication of M with 75: 

Q = 75 M. (A.2) 



B Massive Overlap Operator 

Following Neuberger [44], one can write the massive overlap operator as 

DM = c ((1 + fj) + (1 - A t) 75 sign(Q)) . 

The normalisation c can be absorbed into the fermion renormalisation, and 
will not contribute to any physics. For convenience, we have set c = 1. Thus, 
the regularising parameter p as defined in (1) is related to fi by 

p =(! + //)/(!-//). (B.l) 
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The mass of the fermion is given by 

2/im 

where Z m is a renormalisation factor. 

Another form of the massive overlap operator, which sometimes appears in 
the literature (e.g. in [14]), is 

D u = m+(m - -m)(l + 75 sign(Q)). 

This is equivalent to the formula which we use, with \i = m/(2m ). 



C Projection of low lying eigenvectors 



It is advantageous to project out the low lying eigenvectors of the Wilson 
operator Q when calculating the sign function [18] (see Table C.l). Let the 
eigenvalues of Q 2 be contained in [a, b]. The smallest eigenvalues of Q can be 
projected out of the sign function and out of the multi-mass inversion used to 
calculate the rational fraction [16]. We take the Zolotarev approximation with 
respect to a domain [—Vb, — Va 7 ] U [Va 7 , Vb] where a' > a. Beforehand, we 
compute a set A of the n p smallest eigenvalues of Q and partition A = A x U A 2 
where A 2 contains those eigenvalues which are larger in modulus than a'. If 
ip x are the normalised eigenvectors of the eigenvalues A with respect to which 
we project, then 



sign(Q)x = sign(Q) Ix- ^ V^V) + H sign(A)^ A (^\ x). 
V AeAi / AeAi 

This shows that in order to compute sign(Q)x we need the Zolotarev approx- 
imation only on the range [— Vb, — Va 7 ] U [Va 7 , Vb]- 

The projection approach in the subsequent multi-mass solver is to solve 



Q^ + r.V y = (^Q 2 + ry ly- + 

' V A A.- / 

AGA 2 a' A 

where y = x- EagA! *P X (^ X , x). 
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n p Inversion Calls to Wilson op. Eigenvalue calculation Total time 



1 


9144 


1032172 





9144 


10 


1269 


189514 


111 


1380 


20 


796 


112862 


118 


914 


30 


568 


78548 


172 


740 


40 


459 


63566 


274 


733 


50 


387 


52758 


361 


748 


60 


340 


45732 


410 


750 



Table C.l 

The times (in seconds) needed to calculate one relGMRESR(CG) inversion of the 
overlap operator, and to calculate n p eigenvalues of the Wilson operator for different 
values of n p , on the 8 4 configuration 1 with fi = 0.1. 

n p Inversion Calls to Wilson op. Eigenvalue calculation Total time 



1 


131 


13112 


18 


149 


10 


30 


4860 


14 


44 


20 


24 


3532 


22 


46 


30 


19 


2874 


31 


50 


40 


17 


2474 


60 


77 



Table C.2 

The times (in seconds) needed to calculate one relGMRESR(CG) inversion of the 
overlap operator, and to calculate n p eigenvalues of the Wilson operator for different 
values of n p , on the 4 4 configuration 1, with n = 0.1. 

This eigenvector projection improves the condition number of the inversion, 
and therefore the CG method will converge faster. Note that projecting all 
computed eigenvalues directly out of the sign function would allow us to use 
a larger lower bound a 1 for the Zolotarev expansion which will speed up the 
calculation further. However, this comes with an additional cost when calcu- 
lating the fermionic force, and our preference is to only use this method for 
exceptionally small eigenvalues. Furthermore, in order to satisfy detailed bal- 
ance, we need to use the same Dirac operator throughout the calculation, i.e. 
we are forced to keep a' fixed. 

The eigenvectors have to be calculated every time the gauge field is updated. 
In an HMC algorithm this means that the time taken for each micro-canonical 
step is the sum of the time taken for the calculation of the eigenvectors and 
the time needed for the inversion of the overlap operator (the calculation 
of the remainder of the fermionic force is negligible). Some fine tuning of 
n p , the number of eigenvectors projected out, is therefore required. We used 
an Arnoldi algorithm with Chebyshev improvement to calculate the lowest 
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eigenvalues of the squared Wilson operator [43]. From Tables C.l and C.2, we 
can see that there is a factor of 3 gain in using the eigenvalue projection on 
the smaller lattices, and there is a large factor of 12 on the larger lattices. 



D Hybrid Monte-Carlo with Overlap Fermions 

D.l The Zolotarev approximation. 

We approximate the action of the matrix sign function sign(Q) = / y 5 M(M ji M)~^ 
on a vector y using the Zolotarev rational approximation. This (Zoo) best ap- 
proximation r(A) to sign(A) on [—Vb, —y/a] U [y/a, \/b] is given as 4 [50] 

s ig n(^KA)^. - ggl±g , (D.!) 

where the coefficients are constructed using elliptic integrals as 

sn 2 (jK/(2N z )-K) 



K 



' ~ 1-s^(jK/(2N z );k) 
i dt 



o ^(1 - t 2 )(l - « 2 t 2 ) 



D is uniquely defined via the relation 

max (1 — V\r(X)) = — min (1 — \Z\r 

The rational function r(A) in (D.l) can equivalently be represented by its 
partial fraction expansion 



1 Nz (\ \- x 
KA)=^-A-g^(-V + Tj ) 

where 



u _ Ul=i(c2j-i - c 2k ) 

Uk^j,k=l( C 2j-l - c 2k-l) 
Tj = C 2 j-l 



An alternative form of the Zolotarev expansion can be found in [15]. 
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Using this representation, we approximate the action of the matrix sign func- 
tion on a vector y as 



sign(Q)y ~ —=Q UjAjy, where Aj = (\Q 2 + Tjl 



Herein, the y j = Ajy are calculated using the multi-shift CG method [23,39] 



iteration we set the order of the Zolotarev expansion to be Nz = 25, which 
gave the sign function accurate to machine precision when the multi-mass 
solver was calculated to machine precision. When we needed less precision in 
the relaxed methods, we stopped the multi-mass solver earlier, see [53]. When 
evaluating the sign function in a preconditioner, where we only required an 
accuracy of 10 -1 , we could reduce the order of the expansion to N z = 5. 



D.2 The fermionic force. 



The fermionic part of the Hybrid Monte-Carlo action is given by 



The fermionic force needed for the Hybrid Monte-Carlo algorithm at a lattice 
site x and direction \x is 



as a multi-mass inverter for the systems 




S pf =^X- X = (H N )- 2 <j). 
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F,(x) = (1 - /i 2 ) (Xt 75 ) n (F* nm (b) + F^Jx) + F* nm (x)) X n 
(1 - f)Xl (F^Jx) + F^ nm (x) + F* nm (x)) ( 75 X) m 



F R (r) 



-Qbxl5(l - l^x+^cQcd — 



— K 



r fc7 5(l - ^)S b ,J x+ ,, d ] A k de (l - |^ A ) (^ A |) em 
( 1 ^Q^fc 



F s (r) 



\y/aQ 2 + T k/ 

KP\nxlh sign(A)(l - 7 M )5 :EiC+eM ( ip x ) 

V ' \ ' TLX 



+ 



(k A > (^ A IL sign(A) (^ A L 75(1 - K c 

(1 — |-0 A > <-0 A |)(^ — A)- 1 (l — |-0 A > <^ A |)- 



We assume summations over all repeated indices, including sums over all the 
projected eigenvectors. Note that the fermionic force contains a delta func- 
tion in the smallest eigenvalue. This means that if the smallest eigenvalue 
changes sign during the molecular dynamics of the Hybrid Monte-Carlo, then 
some care needs to be taken when calculating the fermionic force. We will 
discuss this matter fully, and present our solution to the problem, in a future 
publication [16] (see also [21]). 

In order to calculate the fermionic force, we need to perform two multi-mass 
inversions of the Wilson operator and one inversion of the squared overlap 
operator (e.g. by using relGMRESR(CG)). As discussed during this paper, it 
is this second step which is time-consuming. We also need to calculate S p f 
during the Monte-Carlo process, and for this we require just a single inversion 
of the overlap operator (e.g. by using relGMRESR(SUMR)). 
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